Predicting Deposits Subscribe by Telephonic Marketing

Final Project - Kurnia Anwar Ra'if

Import Libraries

A. Data Understanding

Identify numerical and categorical data

Based on the displayed result above

  1. Categorical Data : the datatype of job, marital, education, default, housing, loan, contact, month, poutcome are object.

  2. Numerical Data : Age, balance, day, duration, campaign, pdays, previous are int64, it means that they are numerical variables.

Heatmap Correlation

Pearson Correlation Coefficient

From the graph above, it seems like nothing highly correlated as most of the values is below 0.5. 1 means is correlated while 0 means no correlation. There is no correlation, no relationship between the independent variables.So, it is good practice to check it before selecting the variables into the regression model since is one of the steps to avoid overfitting. Correlation matrix: (0<=|c|<0.3: weak), (0.3<=|c|<0.7: moderate) & (0.7<=|c|<1: strong)

Exploratory Data Analysis : Categorical Data

Exploratory Data Analysis : Numerical Data

B. Data Preparation : Data Train

  1. Drop Duplicated Data
  2. Outliers Analysis
  3. Missing Value Handling
  4. Encoding Preparation

In this part, data preparation do before modelling, so we treatment the train data

1. Drop Duplicated Data

There are not duplicated rows again after drop duplicated.

2. Outliers Analysis
Removing Outliers

Using IQR because from numerical data, the distribution data is not normal distribution

remove outliers : previous column
remove outliers : pdays column
remove outliers : campaign column
remove outliers : duration column
remove outliers : day column
remove outliers : balance column
remove outliers : age column

Now there are no outliers in train Data

3. Missing Value Handling

Missing Value Check each rows

Missing Value check each columns

Missing value check NaN

From the result code above, we saw that there are also missing value in several columns. Below is the list of all missing value for each column :

balance: average yearly balance, if the value is negative because banks charge fees when this happens. And your bank could close your account if it stays negative for too long. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

Now we know what are the missing values contained in each of the columns. Next, we have to decide how should we act on those columns that have missing values. Should we drop that column? Or should we impute the missing value?

First, we have to see what's the percentage of missing value in that column. If it contained a lot of missing value (let say >35%), then we can just drop that column. If not, then we can do some imputation.

Check percentage of missing value in each column selection

so there are 2 columns that have missing value >35%, they are contact and poutcome. But, for me contact column is important to analyze because of in this dataset, the marketing target way is only using celullar and telephone so we mustn't to drop it since the % missing value near 35%. So, we just drop poutcome column. I hope it's wise choice to imputing another column with unknown value.

imputation data for : job, education, contact columns

Check again for each columns

After we doing the missing value handling and imputation, we do the imputation and drop poutcome in data test, we inject the information from imputation data train to data test. we do it for the predict in data test after modelling. Do the same for data preparation modelling in data test but when imputate it reference to treatment in data train.

C. Data Preparation : Data Test

  1. Drop Duplicated Data
  2. Outliers Analysis
  3. Missing Value Handling
  4. Encoding Preparation

In this part, data preparation do before modelling, so we treatment the train data

1. Drop Duplicated Data

There are not duplicated rows again in data test after drop duplicated data. Drop the poutcome feature because we drop it in data train.

2. Remove Outliers

Using IQR

remove outliers : previous column
remove outliers : pdays column
remove outliers : campaign column
remove outliers : duration column
remove outliers : day column
remove outliers : balance column
remove outliers : age column
Now there are no outliers in Data Test

3. Missing Value Handling

Missing Value Check each rows

Missing Value check each columns

From the result code above, we saw that there are also missing value in several columns. Below is the list of all missing value for each column :

Because of poutcome column has been dropped. balance: average yearly balance, if the value is negative because banks charge fees when this happens. And your bank could close your account if it stays negative for too long. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

Now we know what are the missing values contained in each of the columns. Next, we have to decide how should we act on those columns that have missing values. Should we drop that column? Or should we impute the missing value?

First, we have to see what's the percentage of missing value in that column. If it contained a lot of missing value (let say >35%), then we can just drop that column. If not, then we can do some imputation.

Check percentage of missing value in each column selection

for me contact column is important to analyze because of in this dataset, the marketing target way is only using celullar and telephone so we mustn't to drop it since the % missing value near 35%. So, we just drop poutcome column. I hope it's wise choice to imputing another column with unknown value.

imputation data for : job, education, contact columns based on data train as reference imputation

mode_job : modus job in data train

mode_education : modus education in data train

mode_contact : modus contact in data train

check again for each columns

Check for test data and train data

Split into Xtrain,ytrain,Xtest,ytest

Chi-Square Test

The Purpose of Chi-Square Test is carried out to investigate the dependency between the both categorical x and y. Based on the results below, the attribute job, marital, education, defualt, housing, loan, contact, month, poutcome, age, balance, duration, day, campaign, pdays, and previous are not independent to subscribed, we reject the null hypotheses and accept H1.

4. Encoding Preparation : Data Train and Data Test

The sklearn LabelEncoder function is utilized to encode the categorical variables such as job, marital, education, contact, poutcome, default, housing, loan, subscribed and month.

Encode separately for data train and data test

Let's check it

After performing label encoding operation for both train and test datasets, the categorical values are converted to numerical values.

Split Xtrain and Xtest into different categories : numerical and categorical

The code below is is used to split the Xtrain into 2 different categories which one consists of categorical data while the other consists of numerical data.

Standar Scaller / Feature Scalling

Since there are numerical data (continuous) in the table, StandardScaler was used to scales each input variable separately by subtracting the mean and dividing by the standard deviation in order to have a distribution of mean of zero and a standard deviation of one.

D. Modelling

1. Unsupervised Learning : PCA

2. Supervised Learning Using Classification for modelling

  1. Logistic Regression
  2. K-Nearest Neighbours (KNN)
  3. Naive Bayes
  4. Support Vector Machine (SVM)

1. Logistic Regression

For Logistic Regression, a logistic regression classifier is implemented. Hyperparameters are tuned using GridSearchCV and model is then fit to training data.

Predict the test set results and calculate the accuracy.

2. K-Nearest Neighbours (KNN)

KNN is an algorithm which it is non-parametric and lazy (instance based) because it doesn't have a specialized training phase. In this section, the grid search algorithm was used to find the best parameters for the k values in order to have the best accuracy. In this case, k value of 14 provides the highest accuracy score.

Predicting the test results and train result

3. Naive Bayes

Naive Bayes is a simple "probabilistic classifiers" which based on applying Bayes' theorem with strong (naïve) independence assumptions between the features (X) and it is useful for very large dataset. In this section, GaussionNB was imported from sklearn and Xtrain, ytrain were bring fitted into the model in order to do prediction.

Fit the training sets to the model.

Predict testing test and testing train

4. Support Vector Machine (SVM)

A linear SVM was chosen by using the SVC classifier to make prediction on Xtest dataset by fitting the Xtrain and ytrain datasets into the SVM model. Accuracy was calculated and displyed by using score syntax.

E. Evaluation Model

The evaluation mathod used involving confusion metrix, precision-recall curve and also learning curve.

  1. Learning Curve
  2. Model Evaluation
  3. Precision Recall Curve

1. Learning Curve

Function Learning Curve

Learning Curves : Logistic Regression
Learning Curves : KNN
Learning Curves : Naive Bayes
Learning Curves : SVM

2. Model Evaluation

Model Evaluation Function

Logistic Regression

KNN

SVM

Naive Bayes